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Abstract 

In this work, we address the problem of how to recover the optimal solution to the optimiza- 
\ tion problem related to high dimensional data classification using random projection, to 

I 1 ■ which we refer as Recovery of Optimal Solution. This is in contrast to the previous studies 

\ that were focused on analyzing the classification performance using random projection. We 

O ■ reveal the relationship between compressive sensing and the problem of recovering optimal 

solution using random projection. We also present a simple algorithm, termed as Dual 

■ Random Projection, that recovers the optimal solution with a small error by computing 
^ \ dual solution provided that the data matrix is of low rank. 

■ Keywords: Random projection. Primal solution. Dual solution. Low rank 

o. 

1. Introduction 



(N 



Random projection has been widely used in many machine learning tasks, including classi- 
fication (Arriaga and Vempala, 1999; Vempala, 2004; Fradkin and Madigan, 2003; Balcan 
and Blum, 2005; Blum, 2006; Rahimi and Recht, 2008), regression (Maillard and Munos, 
2012; Drineas et al., 2008), clustering (Kaski, 1998; Fern and Brodley, 2003; Boutsidis et al., 
^ \ 2010), dimensionality reduction (Kaski, 1998; Bingham and Mannila, 2001), manifold learn- 

■ ing (Dasgupta and Freund, 2008; Freund et al., 2008), and information retrieval (Goel et al., 

2005). In this work, we focus on random projection for classification. 

Many studies were devoted to analyzing the classification performance using random 
projection. In this paper, we examine the eff'ect of random projection for data classification 
from a very different aspect. In particular, we are interested in accurately recovering the 
optimal solution to the original optimization problem related to data classification using 
random projection. This is particularly useful for feature selection (Guyon and Elisseeff, 
2003), where important features are often selected based on their weights in the linear 
prediction model learned from the training data. In this case, it is insufficient to simply 
guarantee a low classification error for the learned prediction model based on random pro- 
jection. In order to ensure that similar features are selected by the prediction model based 
on random projection, it is important to guarantee that the recovered solution based on 
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random projection is close to the one obtained by solving the original optimization problem 
without random projection. 

The rest of the draft is arranged as follows: Section 2 describes the problem of recovering 
optimal solution by random projection, the center of this work. Section 3 describes the dual 
random projection approach for reconstructing optimal solutions. Section 4 presents the 
main theoretical results for the proposed algorithm. Section 5 presents the proof for the 
theorems stated in Section 4. Section 6 concludes this work with open questions. 



2. The Problem of Recovering Optimal Solutions from Random 
Projection 

Let (xj,yj),i = l,...,n be a set of training examples, where Xj G M'^ is a vector of d 
dimension and m G { — 1, +1} is the binary class assignment for Xj. Let X = (xi, . . . ,Xn) 
and y = include input patterns and the class assignments of all training 

examples. A classifier w G is learned from the training examples by solving the following 
optimization problem: 

A " 

min ollwf + y'^(yix7vir) (1) 

1=1 

where li^z) is a convex loss function that is differentiable ^ . By writing l{z) in its convex 
conjugate form, i.e. 

i(z) = minaz — £^(a), 

where i*{a) is the convex conjugate of i{z) and 17 is a domain for dual variable a, we have 
the dual optimization problem 

n ^ 

max— > ^^.faj) -cxGa (2) 

aeO" 2A 

i=l 

where a = (ai, • • • , an)~^ and D{y) = diag(y) and G is the Gram matrix given by 

G = D{y)X^XD{y) (3) 

In the following, we denote by w* the optimal primal solution to (1), and by Q^, the optimal 
dual solution to (2). The following proposition connects w^^ and cx^,. 

Proposition 1 Let he the optimal primal solution to (1), and a.^ he the optimal dual 
solution to (2), we have 

w* = -^XD(y)a*, and [oL^]i = V i [yiy^ ,i = I, . . . ,n (4) 



1. For non differentiable loss functions such as hinge loss, we could apply the smoothing technique (Nesterov, 
2005) to make it differentiable. 



2 



Recovering Optimal Solution by Dual Random Projection 



The proof of Proposition 1 and other omitted proofs are deferred to the Appendix. When 
dimension d is high and the number of training examples n is large, solving either the primal 
problem in (1) or the dual problem in (2) can be computationally expensive. To reduce the 
computational cost, one common approach is to significantly reduce the dimensionality by 
random projection. Let S G j^'^x™ a Gaussian random matrix, where each entry Sij is 
independently drawn from a Gaussian distribution AA(0, 1) and m is significantly smaller 
than d. Using random matrix S, we generate a new data representation for input data 
points by 

^ S^^i, (5) 



m 



and we solve the following problem in the projected space: 



A " 



2 

i=l 



The corresponding dual problem is written as 



" 1 - 

i„ - J^4(a.)-^a^G« (7) 



where 



mm 

aeO" ^ 2A 

i=l 



G = D{y)X^ XD{y) (8) 

m 



Remark 2 Initially, the choice of Guassian random matrix S is justified by that the expec- 
tation of dot-product of any two examples in the projected space is equal to the dot-product 
in the original space, i.e., 



ek7x,-1 = x7e 



where the last equality follows that E [^»S'5'' 



m 



I. 



X„' X^' 



Let z^, denote the optimal solution to the primal problem (6) in the projected space, 
and a denote the optimal dual solution to (7). Similar to Proposition 1, the following 
proposition, connects and a. 

Proposition 3 We have 



1 1 



A x/m 



XD{y)oL, and [3* 



Vi To 
X,' OZ^ 



(9) 
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Given the optimal solution G M™, the data point x € M'^ is classified by x^S'z^/'^m, 
which is equivalent to defining a new solution w € M'^ given below, to which we refer as the 
naive solution, 

w = -^Sz^ (10) 



m 



The classification performance of w has been examined by many studies (e.g. (Arriaga and 
Vempala, 1999; Shi et al., 2012; Balcan et al., 2006; Maillard and Munos, 2012)). The 
general conclusion is that when the original data is linearly separable with a large margin, 
the classification error for the solution based on random projection is usually small. 

Although these studies show that w can achieve a small classification error under appro- 
priate assumption, it is unclear if solution w is a good approximation of the true optimal 
solution w*. In fact, as we will see the result in Section 4, w is almost guaranteed to be 
a BAD approximation of (i.e., ||w — w*||2 = r2(||w*||2)). This observation leads to an 
interesting question. Is it possible to accurately recover the optimal solution w* based on z*, 
the random projection based solution. We refer to this problem as Recovery of Optimal 
Solution. 

Relationship to Compression Sensing The proposed problem is closely related to 
compressive sensing (Candes and Wakin, 2008; Donoho, 2006) where the goal is to recover 
a high dimensional but sparse vector using a small number of random projections. The key 
difference between our work and compressive sensing is that we don't have the direct access 
to the random measurement of the target vector (which in our case is w*). Instead, z* 
is the optimal solution to (6), the primal problem using random projection. However, the 
following Theorem shows that z^, is a good approximation of S'^w^/^/rn, which includes m 
random measurements of , if the data matrix X is of low rank and the number of random 
measurements m is sufficiently large. 

Theorem 1 With a probability at least 1 — 5 — exp(— m/32), we have 

II / — cT II ^ 2^/2e 11 J 11 
llvmz* — i w»||2 < , ||j vi^*||2 
Vl - e 

provided 

r(log(r^ + r) + log(l/(5)) 



m > 



where constant c is at least 1/32, and r is the rank of X. 

Given the approximation bound in Theorem 1, it is appealing to reconstruct w* using 
the compressive sensing algorithm provided that w* is sparse to certain bases. We note 
that the low rank assumption for data matrix X implies that is sparse with respect 
to the singular vector system of X. However, since z,, only provides an approximation to 
the random measurements of w*, running the compressive sensing algorithm will not be 
able to perfectly recover w^, from z^,. In Section 3, we present an algorithm, that recovers 
w* with a small error provided that the data matrix X is of low rank. Compared to 
the compressive sensing algorithm, the main advantage of the proposed algorithm is its 
computational simplicity because it does not need to compute the eigenvectors of X and 
solve an optimization problem that minimizes the ii norm. 
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Algorithm 1 A Dual Random Projection Approach for Recovering Optimal Solution 
1: Input: input patterns X G R'^^", binary class assignment y G { — Ij+l}", and sample 
size m 

2: Sample a Gaussian random matrix S G ^d.x'^ 

3: Compute the projected data matrix as A = X/^/m. 

4: Compute a by solving the primal problem (6) and constructing a by Proposition 3. 
5: Output: the recovered solution w = —XD{y)Q./X 



3. Algorithm 

To motivate our algorithm, let us revisit the optimal primal solution w^, to (1), which is 
given in Proposition 1, i.e., 



where a* is the optimal solution to the dual problem (2). Given the projected data x = 
S~^yi/y/rn, we have reached an approximate dual problem in (7). Comparing it with the 
dual problem in (2), and noticing that 'E[SS'^ /m] = I. As a result, when the number of 
random projections m is sufficiently large, we would expect S to be close to ol^,. As a result, 
we can use S as an approximate of ct* in (11), which yields a recovered prediction model, 
denoted by w: 



Remark 4 Note that the key difference between the recovered solution w and the naive 
solution w is that w is computed by projecting the optimal primal solution i^, in the projected 
space back to the original space via S, while w is computed directly in the original space 
using the approximate dual solution a. As a result, the naive solution w lies in the subspace 
spanned by the column vectors in random matrix S ( denoted by As ), while the recovered 
solution w lies in the subspace that also contains the optimal solution w*, i,e., the subspace 
spanned by columns of X (denoted by A). The mismatch between spaces As and A leads 
to the large approximation error for "w. 

Algorithm 1 shows the steps of the proposed method. We note that although dual 
variables have been widely used in the analysis of convex optimization (Boyd and Vanden- 
berghe, 2004; Hazan et al., 2011) and online learning (Shalev-Shwartz and Singer, 2006), 
to the best of our knowledge, this is the first time that dual variables have been used in 
conjunction with random projection for recovering optimal solutions. 

To further reduce the recovery error, we develop an iterative method shown in Algo- 
rithm 2. The intuition comes from that if ||w^, — w||2 < e||w*||2 with a small e, we can 
apply the same dual random projection algorithm again to recover Aw = — w, which 
should result in a recovery error of ||Aw||2 < e^||w*||2. This simple intuition leads to an 
iterative method shown in Algorithm 2. If we repeat the process with T iterations, we 
should be able to obtain a solution with a recovery error of . In Algorithm 2, at iteration 




(11) 





i=l 
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Algorithm 2 An Iterative Dual Random Projection Approach for Recovering Optimal 

Solution 

1: Input: input patterns X G R'^^", binary class assignment y G { — Ij+l}"", sample size 
m, and number of iterations T 



Sample a Gaussian random matrix S G i^'^x™ 

Compute the projected data matrix as A = (xi, . . . , x„) = S"^ X/ y/rn. 
Initialize wq = 
for t = 1,. . . ,r do 

Obtain z* G by solving the following optimization problem 



t . A 

= arg mm - 



2 

z + S^y^t-i/V'm 1^ + ^£ (^yiZ^Xi + yiWiLiXj^ (13) 



7: Compute the dual solution a* using 

[a*]i = V£ {vi^J^i + yiw7_ix, 

Update the solution by = — Yll=i yi[S*]iXj/A 
end for 

10: Output the recovered solution w-p 



t, given the recovered solution W(_i obtained from the previous iteration, we then solve the 
optimization problem in (13) that is designed to recover w* — W(_i. 

Remark 5 It is important to note that although Algorithm 2 is consisted of multiple itera- 
tions, the random projection of the data matrix is only computed once before the start of the 
iterations. This important feature makes the iterative algorithm computationally attractive 
as calculating random projections of large data matrix is computationally expensive and has 
been the subject of many studies, e.g., (Achlioptas, 2003; Liberty et al, 2008; Braverman 
et al, 2010). However, it is worth noting that in Algorithm 2 at each iteration, we need to 
compute the dot-product w^Xj for all training data in the original space. We also note that 
Algorithm 2 is related to the Epoch gradient descent algorithm (Hazan and Kale, 2011) for 
stochastic optimization in that the solution obtained in the previous iteration is served as the 
center to the optimization problem of the current iteration. Unlike the algorithm in (Hazan 
and Kale, 2011), we do not shrink the domain size over the iterations in Algorithm 2. 



4. Main Results 

In this section, we will present a bound of the recovery error ||w^, — w||2 for the dual random 
projection algorithm. We will then extend the result to the iterative algorithm. Similar to 
compressive sensing, we need to assume certain sparse structure for the recovery problem. 
In our case, we assume that the data matrix X is of low rank. We note that the low rank 
assumption is closely related to the sparsity assumption made in compressive sensing. This 
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is because w* lies in the subspace spanned by the column vectors of X and the low rank 
assumption of X directly implies that w^, is sparse with respect to the eigen system of X. 

We denote by r the rank of matrix X. The following theorem shows that the recovery 
error of Algorithm 1 is small provided that (1) X is of low rank (i.e., r <C min(d, n)), and 
(2) sufficiently large number of random projections. 

Theorem 2 Let w^, he the optimal solution to (1) and let w he the solution recovered by 
Algorithm 1. Then, with a prohability at least 1 — 5, we have 

2e „ „ 
1 — e 

provided 

r(log(r2 + r) + log(l/(5)) 
m > 

where constant c is at least 1/32. 



Remark 6 According to Theorem 2, the number of required random projections is ^{r log r) . 
This is similar to compressive sensing result if we view rank r as the sparsity measure used in 
compressive sensing. Following the same arguments as compressive sensing, it may he possi- 
ble to argue that Vt{r\ogr) is optimal due to the result of coupon collector's problem (Mowani 
and Raghavan, 1995), although the rigorous analysis remains to he developed. 

As a comparison, the following theorem shows that with a high probability, the naive 
solution w given in (10) (i.e., the solution based on random projection without exploiting 
the dual variables) does not accurately recover the true optimal solution w,,,. 

Theorem 3 With a probability 1 — exp(— (d — r)/32) — exp(— m/32) — 5, we have 



_-r 1 8^2(1 + e) \ 

w - 2 > \ ; w* 2 

m \ 2 1 — e 



provided 

m > 



r(log(r^ + r) + log(l/(5)) 



Remark 7 As indicated by Theorem 3, when m is sufficiently larger than r hut signifi- 
cantly smaller than d, we have ||w — w^\\2 = ^l{y^d/m\\w^,\\2) , indicating that w does not 
approximate w^, well. 

It is important to note that Theorem 3 does not contradict with the previous results 
showing that the random projection method could result in a small classification error if 
the data set is almost linearly separable with a large margin. This is because, to decide if 
w carries similar classification performance as w,,, we need to measure the following term 

max |x^(w — w*)| (14) 

xGspan(x),||x||2<i 
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Since ||w — w^,||2 can also be written as 

||w — w*||2= max x^(w — w^,) 

the quantity defined in (14) could be significantly smaller than ||w — w*||2 if data matrix 
X is of low rank. The following theorem quantifies this statement. 

Theorem 4 With a probability at least 1 — 6, we have 



max X (w* — w) < e 1 + ) ||w*||2 

xespan(x),||x||2<i \ 1 — 

provided 

^ ^ r(log(r ^ + r) + log(l/(5)) 
where constant c is at least 1/32. 



The proof of Theorem 4 can be found in Appendix. We note that Theorem 4 directly implies 
the result of margin classification error for random projection (Blum, 2006). This is because 
when a data point (xj,yj) can be separated by w* with a margin 7, i.e. yjwjxj > 7|w*|, it 

will be classified by w with a margin at least 7 — ^1 + ^j^^^ £ provided 7 > ^1 + ^jr^^ ^■ 
Using Theorem 2, we now state the recovery result for the iterative method in Algo- 
rithm 2. 



Theorem 5 Let w^, be the optimal solution to (1) and let wt be the solution recovered by 
Algorithm 2. Then, with a probability at least 1 — 6, we have 



2e 



||w* - Wt||2 < I^Y^T^ j Il'^*ll2 

provided 

r(log(r2 + r) + \og{T/6)) 
m > 



where constant c is at least 1/32. 



5. Analysis 

Before presenting the analysis, we first establish some notations and facts. Let the SVD of 
X be 

r 

X = uj:v^ = x^uivj 

i=l 

where Aj is the ith singular value of X, Uj and Vj are the corresponding left and right 
singular vectors of X. Let U = (ui, . . . , u^), = (vi,...,Vr). Using the singular value 
decomposition of X, we define 
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It is straightforward to show that 



1 ^ 1 ^ 



Since U is an othorgonal matrix, we have 

,, ,, 1 ^„ 1„_„ 

||W*||2 = -||7*||2, ||w||2 = -||7||2 

Let us define A = S ^ W^"^. It is easy to verify that A is an Gaussian matrix of size 
r X m. 

5.1. Proof of Theorem 2 

The key to our analysis is to show that G in (7) is cfose to G in (2) when the number of 
random projections is sufficiently large. To this end, we need the following concentration 
inequality for Gaussian random matrix. 

Corollary 6 Let M € M''^™' he a standard Gaussian random matrix. Then, with a proba- 
bility at least 1 — 6, we have 

1 



MM' - I 
m 



< e 

2 



provided that 

r (log(r^ + r) + log(l/5)) 



m > 



where ||M||2 is the spectral norm of matrix M and c is a constant whose value is at least 
1/32. 

Remark 8 The above corollary serves the key to our analysis, which enable us to bound 
G — G and furthermore to hound a* — S. 

Using Corollary 6, we have the following theorem that bounds the difference between G 
and G. 

Theorem 7 With a probability 1 — 5, we have 

{l + e)G^Gh {l-e)G 

provided 

r (log(r^ + r) + log(l/5)) 
m > 



Proof We rewrite G and G as 



G = D{y)VT.U^U^V^ D{y) 

G = D{y)V^U^ ^U^V^ D{y) = D{y)V^ T.V'^ D{y) 

m m 
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Then with a probabiHty 1 — 5 under the given condition on m, we can show that 

- //J/tT \ 
G-{l+e)G = D{y)V^ (1 + e)/ J:V^D{y) ^ 



and 



V m 



AA 



m 



G-{l-e)G = D{y)V^ [ - (1 - e)I ] ^V^ D{y) h 



using the result in Corollary 6 since A is a Guassian matrix of size r x m. ■ 

We now give the proof for Theorem 2. The basic logic is straightforward. Since G is 
close to G, we would expect a, the optimal solution to (7), to be close to a*, the optimal 
solution to (2). Since w^^ = XD[y)cx^,/ \ and w = XD{y)a/X, we would then expect w to 
be close to w^,. 

Proof [Theorem 2] Define L{a.) and L{a) as 

n ^ n ^ 

L{a) = - ^ 4(ai) - —cJGa, L{cy.) = - ^ 4(aj) - -^a^Ga 

i=l 1=1 

Since a maximizes L{a) over the domain 0", we have 

L(a) > L(a,) + ^(S-a,)^G(a-a,) (15) 
Using the concaveness of L{a), we have 

L(S) < L(a*) + (S - q*)^VL(q;*) = L(q!*) + (a - a*)"^ (^VL(q*) - VL(q;*) + VL(q;* 

< L{a^) + \{a-a^f{G-G)a^ (16) 
A 

where the last inequality follows from the fact that {a — Q*)^V-C/(a*) < since o;* maxi- 
mizes L{a.) over the domain fi". Combining the inequalities in (15) and (16), we have 



y(S - a^y{G - G)a^ > -^(S - q;*)^G(S - a*) 
A 2A 



Therefore 



.T f AA'\ 1,_ ,^AA' ,^ 

7 - 7*^ (l 7* > 7j(7 - 7*)^ (7 - 7*) (17) 

\ m J 2 m 

Using Corollary 6, with a probability 1 — 5, we have ||I — ^^^/m||2 < e and therefore 

(1 - e)||7 - 7=^112 < 2e||7*||2 

We complete the proof using the fact that 

1 ^ 1 ^ 

w* = -C/7*, w = -Uj. 
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5.2. Proof of Theorem 3 

As indicated before, the key reason for the large difference between w and w^, is because 
they do not he in the same subspace: w* hes in the subspace spanned by the columns in 
U while w lies in the subspace spanned by the column vectors in a random matrix. Before 
presenting our analysis, we first state a version of John Linderstrauss theorem that is useful 
to our analysis. 

Theorem 8 (Theorem 2 (Blum, 2006)) Let x G W^, and x = S^yi/^/m, where S G M"'^™ 
is a random matrix whose entries are chosen independently from J\f{0, 1). Then 

Pr {(1 - e)Ml < ml < (1 + e)Ml} > 1 - 2exp (-^(e^ - e') 

In the subspace orthogonal to ui, . . . , u^, we randomly choose a subset of d—r orthogonal 
bases, denoted by u^+i, . . . , Ud- Let U±_ = (u^+i, . . . , u^). Since 

II w* — w||2 = max x''"(w* - w), 

l|x||2<l 

to facilitate our analysis, we restrict the choice of x to the subspace span(ur+i, . . . , u^) and 
have 

||w=K — w||2 > max x'''w 

xespan(ur+i,.--,Ud),l|x||2<i 

where we use the fact w^, _L span(ur+i, . . . , u^^). Write x as x = U±a., where a G M'^"''. 
Define 

A = UlS G m("'-^'-)x'" 
As a result, we bound ||w=k — w||2 by 

max = ma^ ^aJuJSS'^Uj = ^\\AA^^\\2 (18) 

xgspan(ur+i,...,um),||x||2<i l|a||2<i JT^-A mX 

where 7 is given by 

7 = ^V^D{y)a 

It is easy to verify that A and A are two independent Gaussian random matrices. There- 
fore, we can fix the vector A^'j and estimate how the random matrix A affect the norm of 
vector A^7. According to the John Linderstrauss Theorem (i.e. Theorem 8), for a fixed 
vector ^^7, with a probability 1 — exp(— (e^ — e^){d — r)/A), we have 



^ ||A^^7||2 > Vl - ep^7| 
\/d — r 



2 



By choosing e = 1/2, we have, with a probability 1 — exp(— (d — r)/32), 

^ '|A^"^7||2 > ^P"^7||2 (19) 



We now bound ||^'''7||2. Note that we cannot directly apply the John Linderstrauss 
Theorem to bound the length of A'^'y because 7 is a random variable depending on the 
random matrix A. To decouple the dependence between A and 7, we expand ||^^7||2 as 

P^7ll2 > P^7*l|2 - \\A^h* - 7)l|2 (20) 
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where 

We bound the two terms on the right side of the inequahty in (20) separately. Using the 
John Linderstrauss Theorem, with a probabihty 1 — exp(— m/32), we bound ||j4^')'=i,|| by 

^ "AT7*||2>^||7,||2 = ^||w,||2 (21) 



To bound the second term ||^^(7* — 7)||, with a probabihty 1 — 5, we have 
1 



m 



^ (7* - 7)l|2 < A/Amax(^^^/m)||7* - 7II2 < Vl + eA||w, - w||2 



where we use the result in Corollary 6. According to Theorem 2, with a probability 1 — 5, 
we have 

II ~ll <■ 2e II II 
1 — e 

As a result, with probability 1 — 5, we have 

^pT(7, - 7)||2 < A^/rTiT^||w,||2 (22) 
'm 1 — e 



We complete the proof by putting together (18), (19), (20), (21), and (22). 
5.3. Proof of Theorem 5 

Given a solution obtained at iteration t, we consider the following optimization problem 

min Lt(w;X,y) = -||vir + vi^t||2 + y'^(yj(w + wt)^Xi) (23) 
wgR'' 2 ^ 

2 = 1 

It is straightforward to show that A*+-^ = — wj is the optimal solution to (23). Then we 
can use the dual random projection approach to recover A*"*"^ by A^+i. If we can similarly 
show that 

||A,+i-At+i||2<^||Ani2 

then we define the updated recovered solution by w^+i = + A^+i and have 

2e „^.,i„ 2e 

Continously, if we repeat the above process for t = 1, . . . ,T, the recovery error of is 
given by 

2e V-\.^ „ / 2e 



|wr - W^,||2 < ( ||wi - Vi^*||2 < l^j— l|w*||2 

The remaining question is how to compute the A^+i using the dual random projection 
approach. In order to make the previous analysis remain valid for the recovered solution 
Aj+i to the problem (23), we need to write the primal optimization problem in the same 
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form as in (1). To this end, we first note that wj hes in the subspace spanned by xi, . . . , x„, 
thus we write as 



_ 1 _ 



X 



Thus, L((w;X, y) can be written as 



Lt{w;X,y) = -\\wt\\l + -\\w\\l + Xw^ wt + '^i{yiW^Xi + yiwj Xi) 

1=1 



A _ A 

i=l 



^'|wt||2 + ^||w||^ +^4(yiW^Xi) 



2" 2 

i=l 

where the new loss function il{z),i = 1, ... ,n is defined as 



iliz) = iiz + yi^l Xi) - [a%z (24) 
Therefore A*"*"^ is the solution to the following problem 



n 



A*+i = arg min -^HwHl + V ^(yiW^Xi 



weM" 2 . ^ 

1=1 



To apply the dual random projection approach to recover A*+^, we solve the following 
optimization problem in the projected space: 

A 

min —\\7,\\'i + tfiyi'L^Xi) 

i=l 

The following derivation signifies that the above problem is equivalent to the problem in (13). 

A " 
miuzgRm -||z||2 + y^J\{yi'zJxi) 



2 

A 



i=l 

n 

|2 
l2 



r, + ^-^(yiz Xj + Xj) - [a*]jyiz Xj 
^ t=i 

A A _ " 

= ol|z||2 + ^=2^(5^ wt) + V]£(yjZ^Xj +yiw7xi) 

A _ " _ A _ 

= -||z + S'^wj/Vm||2 + ^^(yiZ^Xj + y^w^Xj) - -||5'^wj/Vm|p 

i=l 

where we use = — ^Ja*]jyjXj. Given the optimal solution z*"^^ to the above problem, 
we can recover A*+^ by 

At+i = -^XL>(y)St+i 
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where at+i is computed by 

The updated solution wj+i is computed by 

wt+i = wt + At = ^[a*+^]iXj?/j 

1=1 

where [5*+^]^ = [o.t+i\i + \&% = V£{yiSc~^zi + y^w^Xj). 
6. Conclusion 

In this paper, wc discuss the problem of recovering optimal solutions through random 
projection. Our goal is to first efficiently obtain an approximate solution z* for a given 
optimization problem by using random projection and then reconstruct the true optimal 
solution w* from the random projection based solution z*. We developed a dual random 
projection approach and show that under the assumption that the data matrix X is of low 
rank, the proposed approach is able to accurately recover the true optimal solution w* with 
a small error. 

There are several open questions that need to be addressed in the future. The first 
open question is to analyze the behavior of the proposed algorithm when X can be well 
approximated by a low rank matrix, an assumption that is significantly weaker than the low 
rank assumption. The second open question is to develop a parallel version of the proposed 
algorithm by running it independently over multiple machines. The challenge is to design 
an effective approach for combining multiple sets of random projection based solutions into 
one solution with significantly smaller recovering error. This is an important question when 
we need to adapt the proposed algorithm to a distributed computing environment. 
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Appendix A. Proof of Proposition 1 and Proposition 3 

Since the two propositions can be proved similarly, we only present the proof of Proposi- 
tion 1. First if is the optimal dual solution, the optimal primal solution can be solved 
by 

A " 
= arg min -||w||2 + y^[a^]iyi:xij 

1=1 

By setting the gradient with respect to w to zero, we obtain = — ^"=i[Q!]iyiXj/A = 
-XD{y)a/X 

Second, to prove the dual solution a,, given the primal solution w,,, we note that 

By the Fenchel conjugate theory (e.g., Theorem 11.4 in(Cesa-Bianchi and Lugosi, 2006)) 
we have ex. satisfying 

[a^]i = V^(yiXiW*). 
Appendix B. Proof of Corollary 6 

In the proof, we make use of the following concentration inequality regarding the eigenvalues 
of Gaussian random matrix. 

Theorem 9 (Corollary 7.2 (Gittens and Tropp, 2011)) Let C G R^^^ he a positive definite 
matrix. Let tjj £ = l,...,n be i.i.d. samples drawn from a J\f{0,C) distribution. 

Define 



- 1 



n 
i=i 



Write Afc for the kth eigenvalue of C , and write Xk for the kth eigenvalue of Cn. Then, for 
k = l,...,p, 

Pr {Afc > (1 + e)Afe} < {p - k + 1) exp ("^P^^^yyA^) ' /^'^ ^ ^ 



and 



PrjAfc < (l-e)Afc| < A;exp (- , """"f ] , /or e G (0, 1] 
^ ^ \ Li=iAiAi/A^y 

where constant c is at least 1/32. 

We write M = (rji, . . . , rjm), where r]i G W is i.i.d sample from a Gaussian distribution 
Af{0,L) and write MM^ /m as 



m m ^-^ 

i=l 

Using Theorem 9, we have, 

1 - e < Xk{Cm) <l + e,k = l,...,r 
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with the failure probabihty at most 

y~^(r — A; + 1) exp ( —cm — j + A; exp f —cm — j = (r^ + r) exp ( j 

We complete proof by setting the above failure probability to be less than 5. 

Appendix C. Proof of Theorem 4 

We write w in terms of w as w = SS^w/m and therefore 

max x^(w*— w) < ||w^, — w||2 + max x'''(w — w) 

||x||2<i,xGspan{X) ||x||2<i,xGspan(X) 

= ||w, - wlh + max f / - —U^SS^u] j/X 

||a||2<l \ m J 
< ||W, - W||2 + A^ax (l - -U^SS^U] llwlb 



m 
1 



< ||w^, - w||2 + Amax ( -^^ - — -4^"^ ) ||w^,|| 

\ m J 

The last but one inequality uses the fact ||w||2 = II7II2/A. Using Corollary 6, we have, with 
a probability 1 — 5, 

Amax [I-—AA^] < e 
\ m ) 

We complete the proof by using the bound for ||w* — w||2 stated in Theorem 2. 

Appendix D. Proof of Theorem 1 

According to (17) in the proof of Theorem 2, we have 

2(7 - 7*)^ (i - ^) 7* > (7 - 7*)^^(7 - 7*) 
\ m J m 

Using the fact y/ni'z^ = A^^/\ and S^v^^ = A^7*/A, we have 

— ||^^^w' - < 2(7 - 7*)^ (l - ^) 1* 

m \ m J 



Using Corollary 6, with a probability 1 — 5, we have 



\\^/mx^ — S''''w*||2 < 2e||vi^=|,||2||vif — w*||2 

m 



Using Theorem 2, with a probability 1 — 5, we have 

1 Ae^ 

— ||A/mz* - 5''^w*||2 < llw^llo (25) 

m I — e 

To replace w* on R. H. S. of the above inequality with S^w^,, we make use of Theorem 8. 
As a result, with a probability 1 — exp(— m/32), we have 

— ||5^w*||^ > ^||w*||^ (26) 
m 2 

We complete the proof by combining the two inequalities in (25) and (26). 
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